Bike sharing systems generate strong daily and seasonal patterns that are closely tied to weather and commuting behaviour.
The goal of this project is to predict hourly bike rental demand (count) and to understand which factors matter most for usage.
Accurate forecasts can help operators plan fleet allocation, staff shifts, and maintenance more efficiently.
I work with the Kaggle “Bike Sharing Demand” dataset (2011–2012), focusing on hourly demand, weather, and calendar effects.
The raw training data contain an hourly timestamp (datetime), categorical indicators for season, holiday, workingday, and weather, continuous weather variables including temp, atemp, humidity, and windspeed, and three demand variables: casual, registered, and their sum count, which is the target of interest.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(visdat)
library(skimr)
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
##
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
##
## The following object is masked from 'package:datasets':
##
## sleep
library(ggplot2)
library(dplyr)
library(recipes)
##
## Attaching package: 'recipes'
##
## The following object is masked from 'package:VIM':
##
## prepare
##
## The following object is masked from 'package:stringr':
##
## fixed
##
## The following object is masked from 'package:stats':
##
## step
library(readr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.8 ✔ rsample 1.3.0
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.8 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.1
## ✔ parsnip 1.3.2 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::prepare() masks VIM::prepare()
## ✖ car::recode() masks dplyr::recode()
## ✖ car::some() masks purrr::some()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(Metrics)
##
## Attaching package: 'Metrics'
##
## The following objects are masked from 'package:yardstick':
##
## accuracy, mae, mape, mase, precision, recall, rmse, smape
library(patchwork)
train <- read_csv('/Users/sam/bikesharingdemand/data/train.csv')
## Rows: 10886 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): season, holiday, workingday, weather, temp, atemp, humidity, wind...
## dttm (1): datetime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
season, holiday, working day, weather = factor.
str(train)
## spc_tbl_ [10,886 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:10886], format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
## $ season : num [1:10886] 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: num [1:10886] 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : num [1:10886] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num [1:10886] 0 0 0 0 0 ...
## $ casual : num [1:10886] 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num [1:10886] 13 32 27 10 1 1 0 2 7 6 ...
## $ count : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
## - attr(*, "spec")=
## .. cols(
## .. datetime = col_datetime(format = ""),
## .. season = col_double(),
## .. holiday = col_double(),
## .. workingday = col_double(),
## .. weather = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. humidity = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. count = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(train)
## datetime season holiday
## Min. :2011-01-01 00:00:00 Min. :1.000 Min. :0.00000
## 1st Qu.:2011-07-02 07:15:00 1st Qu.:2.000 1st Qu.:0.00000
## Median :2012-01-01 20:30:00 Median :3.000 Median :0.00000
## Mean :2011-12-27 05:56:22 Mean :2.507 Mean :0.02857
## 3rd Qu.:2012-07-01 12:45:00 3rd Qu.:4.000 3rd Qu.:0.00000
## Max. :2012-12-19 23:00:00 Max. :4.000 Max. :1.00000
## workingday weather temp atemp
## Min. :0.0000 Min. :1.000 Min. : 0.82 Min. : 0.76
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:13.94 1st Qu.:16.66
## Median :1.0000 Median :1.000 Median :20.50 Median :24.24
## Mean :0.6809 Mean :1.418 Mean :20.23 Mean :23.66
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:26.24 3rd Qu.:31.06
## Max. :1.0000 Max. :4.000 Max. :41.00 Max. :45.45
## humidity windspeed casual registered
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 47.00 1st Qu.: 7.002 1st Qu.: 4.00 1st Qu.: 36.0
## Median : 62.00 Median :12.998 Median : 17.00 Median :118.0
## Mean : 61.89 Mean :12.799 Mean : 36.02 Mean :155.6
## 3rd Qu.: 77.00 3rd Qu.:16.998 3rd Qu.: 49.00 3rd Qu.:222.0
## Max. :100.00 Max. :56.997 Max. :367.00 Max. :886.0
## count
## Min. : 1.0
## 1st Qu.: 42.0
## Median :145.0
## Mean :191.6
## 3rd Qu.:284.0
## Max. :977.0
skim(train)
| Name | train |
| Number of rows | 10886 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 11 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| season | 0 | 1 | 2.51 | 1.12 | 1.00 | 2.00 | 3.00 | 4.00 | 4.00 | ▇▇▁▇▇ |
| holiday | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| workingday | 0 | 1 | 0.68 | 0.47 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
| weather | 0 | 1 | 1.42 | 0.63 | 1.00 | 1.00 | 1.00 | 2.00 | 4.00 | ▇▃▁▁▁ |
| temp | 0 | 1 | 20.23 | 7.79 | 0.82 | 13.94 | 20.50 | 26.24 | 41.00 | ▂▇▇▇▁ |
| atemp | 0 | 1 | 23.66 | 8.47 | 0.76 | 16.66 | 24.24 | 31.06 | 45.45 | ▁▆▇▇▁ |
| humidity | 0 | 1 | 61.89 | 19.25 | 0.00 | 47.00 | 62.00 | 77.00 | 100.00 | ▁▃▇▇▅ |
| windspeed | 0 | 1 | 12.80 | 8.16 | 0.00 | 7.00 | 13.00 | 17.00 | 57.00 | ▇▆▂▁▁ |
| casual | 0 | 1 | 36.02 | 49.96 | 0.00 | 4.00 | 17.00 | 49.00 | 367.00 | ▇▁▁▁▁ |
| registered | 0 | 1 | 155.55 | 151.04 | 0.00 | 36.00 | 118.00 | 222.00 | 886.00 | ▇▃▁▁▁ |
| count | 0 | 1 | 191.57 | 181.14 | 1.00 | 42.00 | 145.00 | 284.00 | 977.00 | ▇▃▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| datetime | 0 | 1 | 2011-01-01 | 2012-12-19 23:00:00 | 2012-01-01 20:30:00 | 10886 |
vis_miss(train)
train <- train %>%
mutate(
year = year(datetime),
month = month(datetime),
day = day(datetime),
hour = hour(datetime),
wday_num = wday(datetime),
wday = wday(datetime, label = TRUE),
log_count = log(count + 1)
)
train <- train %>%
mutate(
season = factor(season,
levels = c(1,2,3,4),
labels = c("Winter","Spring","Summer","Fall")),
holiday = factor(holiday, levels = c(0,1), labels=c("No","Yes")),
workingday = factor(workingday, levels = c(0,1), labels=c("No","Yes")),
weather = factor(weather,
levels = c(1,2,3,4),
labels = c("Clear","Mist","Light Snow/Rain","Heavy Rain/Snow")),
year = factor(year),
month = factor(month, levels = 1:12,
labels = c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec")),
day = factor(day, levels = 1:31),
hour = factor(hour, levels = 0:23),
wday_num = factor(wday_num, levels = 1:7,
labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")),
wday = factor(wday,
levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"))
)
str(train)
## tibble [10,886 × 19] (S3: tbl_df/tbl/data.frame)
## $ datetime : POSIXct[1:10886], format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
## $ season : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ workingday: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ weather : Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num [1:10886] 0 0 0 0 0 ...
## $ casual : num [1:10886] 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num [1:10886] 13 32 27 10 1 1 0 2 7 6 ...
## $ count : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
## $ year : Factor w/ 2 levels "2011","2012": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ wday_num : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ wday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 7 7 7 7 7 7 7 7 7 7 ...
## $ log_count : num [1:10886] 2.833 3.714 3.497 2.639 0.693 ...
p1 <- ggplot(train, aes(count)) +
geom_histogram(bins = 40, fill="grey") +
labs(title="Distribution of count")
p2 <- ggplot(train, aes(log_count)) +
geom_histogram(bins = 40, fill="grey") +
labs(title="Distribution of log(count+1)")
gridExtra::grid.arrange(p1, p2, ncol=2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
bc <- boxcox(count ~ 1, data = train)
lambda_opt <- bc$x[which.max(bc$y)]
lambda_opt
## [1] 0.3030303
lambda <- 0.3
train$bc_count <- ((train$count + 1)^lambda - 1) / lambda
ggplot(train, aes(bc_count)) +
geom_histogram(bins = 40, fill="grey") +
labs(title = "Box-Cox transformed count (lambda = 0.3)")
I started with converting categorical variables to factors and original timestamp was decomposed into year, month, day, and hour in order to capture distinct temporal demand patterns at different time scales. Since the response variable count shows a severe right-skewed shape, I intially applied a log+1 transformation. To further stabilise variance and approximate normality, I applied Box-Cox transformation and the optimal lambda was approximately 0.3. After applying this transformation, the target distribution became substantially closer to a normal shape.
# hourly demand
train %>%
group_by(hour) %>%
summarise(mean_count = mean(count)) %>%
ggplot(aes(x = as.numeric(as.character(hour)), y = mean_count)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 0:23) +
labs(title = "Average count by hour of day",
x = "hour")
Hourly demand exhibits a strong commuting pattern. The demand increases sharply between 6 and 8 AM, peaks during this period, and declines around 9 AM. The demand rises again from approximately 4 PM, reaching another peak at 5–6 PM before dropping rapidly after 7 PM. Although nighttime demand is relatively low, it is not negligible. Overall, the dominant demand is concentrated around commuting hours, with a noticeable secondary increase around midday.
# wroking day and hour
train %>%
group_by(workingday, hour) %>%
summarise(mean_count = mean(count), .groups="drop") %>%
ggplot(aes(x = as.numeric(as.character(hour)), y = mean_count, colour=workingday)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=0:23) +
labs(title="Hourly demand pattern by workingday")
When I separated working days from non-working days, distinct behavioral patterns emerged. On working days, demand is heavily concentrated around morning and evening commuting peaks. On weekends, demand increases steadily from the morning, remains high through the afternoon, and gradually declines in the late afternoon and evening.
# week day and hour
train %>%
group_by(wday, hour) %>%
summarise(mean_count = mean(count), .groups="drop") %>%
ggplot(aes(x = as.numeric(as.character(hour)), y = mean_count, colour=wday)) +
geom_line() +
scale_x_continuous(breaks=0:23) +
labs(title="Hourly demand by day of week")
The day-of-week effects beyond the weekday–weekend distinction seem very weak. Monday through Friday follow nearly identical patterns, as do Saturday and Sunday. This suggests that a binary working-day indicator is sufficient, and detailed weekday dummies add only little explanatory value.
# Seasons and count
ggplot(train, aes(season, count, fill=season)) +
geom_boxplot() +
labs(title="Season vs Count")
Seasonal effects are also pronounced. Demand is lowest during winter and highest during summer, reflecting temperature-driven cycling behaviour. Because season and month share overlapping information, including both simultaneously in the model would introduce redundancy and multicollinearity. Therefore, one of them can be safely removed during modelling.
# holiday and working day
p_hol <- ggplot(train, aes(holiday, count, fill=holiday)) +
geom_boxplot() + labs(title="Holiday vs count")
p_work <- ggplot(train, aes(workingday, count, fill=workingday)) +
geom_boxplot() + labs(title="Workingday vs count")
gridExtra::grid.arrange(p_hol, p_work, ncol=2)
Holiday effects show an interesting contrast. Outside of holiday periods, overall demand tends to be higher. During holidays, weekday demand exceeds weekend demand, suggesting that commuting-related usage dominates even in holiday periods. While median demand levels do not differ dramatically, the interquartile range and upper quartiles show clear separation, indicating that high-demand events are more frequent on working days.
# Weather impact
ggplot(train, aes(weather, count, fill=weather)) +
geom_boxplot() +
labs(title="Weather vs count")
As expected, worsening weather conditions lead to a clear reduction in bike demand. Heavier rain or snow is associated with markedly lower demand.
p1 <- ggplot(train, aes(casual)) +
geom_histogram(bins = 40, fill="skyblue") +
labs(title="Distribution of casual users")
p2 <- ggplot(train, aes(registered)) +
geom_histogram(bins = 40, fill="orange") +
labs(title="Distribution of registered users")
p1+p2
train %>%
group_by(hour) %>%
summarise(
casual_mean = mean(casual),
registered_mean = mean(registered)
) %>%
pivot_longer(cols = c(casual_mean, registered_mean),
names_to = "type",
values_to = "mean_val") %>%
ggplot(aes(x = as.numeric(as.character(hour)), y = mean_val, color = type)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 0:23) +
labs(title = "Hourly usage pattern: casual vs registered",
x = "hour", y = "mean count")
train %>%
group_by(workingday) %>%
summarise(
casual_mean = mean(casual),
registered_mean = mean(registered)
) %>%
pivot_longer(cols = c(casual_mean, registered_mean),
names_to="type",
values_to="mean_val") %>%
ggplot(aes(x=workingday, y=mean_val, fill=type)) +
geom_col(position="dodge") +
labs(title="Casual vs Registered by Workingday")
train %>%
group_by(workingday, hour) %>%
summarise(
casual_mean = mean(casual),
registered_mean = mean(registered),
.groups="drop"
) %>%
pivot_longer(cols = c(casual_mean, registered_mean),
names_to = "type",
values_to = "mean_val") %>%
ggplot(aes(
x = as.numeric(as.character(hour)),
y = mean_val,
color = type
)) +
geom_line(size = 1) +
facet_wrap(~ workingday, ncol = 1,
labeller = labeller(workingday = c("No" = "Weekend", "Yes" = "Weekday"))) +
scale_x_continuous(breaks = 0:23) +
labs(
title = "Hourly pattern for casual vs registered on weekend vs weekday",
x = "Hour",
y = "Mean count"
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The majority of bike usage comes from registered users, whose behaviour largely determines the overall demand structure. Registered users primarily ride for commuting purposes, which explains the sharp weekday peaks. Even on weekends, registered users account for more trips than casual users. This suggests that individuals who regularly commute by bike also tend to ride recreationally on weekends. Casual users, in contrast, are almost absent on weekdays and mainly appear on weekends. I have checked that the total demand (count) is the sum of casual and registered. So these variables can be safely removed. Including these variables in predictive models would allow the model to reconstruct the target directly rather than learning meaningful relationships. Therefore, both variables were excluded from all models.
#numerical variables
p_temp <- ggplot(train, aes(temp)) + geom_histogram(bins=40)
p_atemp <- ggplot(train, aes(atemp)) + geom_histogram(bins=40)
p_hum <- ggplot(train, aes(humidity)) + geom_histogram(bins=40)
p_wind <- ggplot(train, aes(windspeed)) + geom_histogram(bins=40)
gridExtra::grid.arrange(p_temp, p_atemp, p_hum, p_wind, ncol=2)
# corr matrix
train %>%
dplyr::select(temp, atemp, humidity, windspeed, count) %>%
GGally::ggcorr(label = TRUE)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Temperature (temp) represents actual air temperature, while atemp reflects perceived feels-like temperature. Both variables show clear seasonality and several discrete peaks, likely due to rounded measurements. Their distributions are approximately normal and align well with real-world cycling behaviour. Demand is highest at moderate temperatures and decreases at extreme cold or heat. Because cycling exposes users directly to outdoor conditions, temperature is one of the strongest determinants of demand. As expected, temp and atemp are almost perfectly correlated. So, either can be used.
Humidity displays a right-skewed distribution, and higher humidity generally suppresses demand, likely due to discomfort and its association with poor weather.
Windspeed shows an unusually large number of zero values. In reality, sustained zero windspeed is extremely rare, and the fact that zero is the modal value strongly suggests that unmeasured or missing observations were recorded as zero.
For this reason, I will wtreat windspeed values equal to zero as missing values and will impute them using KNN imputation.
train_imp <- train %>%
mutate(windspeed = ifelse(windspeed == 0, NA, windspeed))
imp_data <- train_imp %>%
mutate(
season_num = as.numeric(season),
weather_num = as.numeric(weather),
year_num = as.numeric(year),
month_num = as.numeric(month),
hour_num = as.numeric(hour)
) %>%
dplyr::select(
windspeed, temp, atemp, humidity,
season_num, weather_num, year_num, month_num, hour_num
)
imp_result <- kNN(
imp_data,
variable = "windspeed",
k = 5,
imp_var = FALSE
)
## temp atemp humidity season_num weather_num year_num
## 0.820 0.760 0.000 1.000 1.000 1.000
## month_num hour_num temp atemp humidity season_num
## 1.000 1.000 41.000 45.455 100.000 4.000
## weather_num year_num month_num hour_num
## 4.000 2.000 12.000 24.000
train_imp$windspeed <- imp_result$windspeed
p_wind <- ggplot(train, aes(windspeed)) + geom_histogram(bins=500, fill="red")+
labs(title="Before KNN Imputation (0 → NA)")
p_new <- ggplot(train_imp, aes(windspeed)) +
geom_histogram(bins = 500, fill="blue") +
labs(title="After KNN Imputation")
gridExtra::grid.arrange(p_wind, p_new, ncol=2)
After imputation, the artificial spike at zero disappears, producing a
more realistic windspeed distribution and preventing distortion in
downstream models.
This correction would prevent distorted model behaviour, especially in linear and Poisson regression where extreme zeros can bias coefficient estimates, and in Random Forest models where such spikes degrade split purity.
year_summary <- train_imp %>%
group_by(year) %>%
summarise(
mean_count = mean(count),
sd_count = sd(count),
n = n(),
se_count = sd_count / sqrt(n)
)
p1<- ggplot(year_summary, aes(x = factor(year), y = mean_count)) +
geom_col() +
geom_errorbar(aes(ymin = mean_count - se_count,
ymax = mean_count + se_count),
width = 0.2) +
labs(title = "Average count by year", x = "year", y = "count")
month_summary <- train_imp %>%
group_by(month) %>%
summarise(
mean_count = mean(count),
sd_count = sd(count),
n = n(),
se_count = sd_count / sqrt(n)
)
p2<- ggplot(month_summary, aes(x = factor(month), y = mean_count)) +
geom_col() +
geom_errorbar(aes(ymin = mean_count - se_count,
ymax = mean_count + se_count),
width = 0.2) +
labs(title = "Average count by month", x = "month", y = "count")
year_month_summary <- train_imp %>%
mutate(
year_num = as.integer(as.character(year)),
month_num = as.integer(month),
year_month = sprintf("%d-%02d", year_num, month_num)
) %>%
group_by(year_month) %>%
summarise(
mean_count = mean(count),
sd_count = sd(count),
n = n(),
se_count = sd_count / sqrt(n),
.groups = "drop"
)
p3<- ggplot(
year_month_summary,
aes(
x = factor(year_month, levels = sort(unique(year_month))),
y = mean_count
)
) +
geom_col() +
geom_errorbar(
aes(
ymin = mean_count - se_count,
ymax = mean_count + se_count
),
width = 0.2
) +
labs(
title = "Average count by year-month",
x = "year_month",
y = "count"
) +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
(p1+p2)/p3
Finally, examining demand by year and month revealed a clear increase in usage from 2011 to 2012, indicating system growth. Monthly patterns show strong seasonality.
day_summary <- train_imp %>%
group_by(day) %>%
summarise(
mean_count = mean(count),
sd_count = sd(count),
n = n(),
se_count = sd_count / sqrt(n),
.groups = "drop"
)
p4 <- ggplot(day_summary, aes(x = day, y = mean_count)) +
geom_col(fill = "grey40") +
geom_errorbar(
aes(ymin = mean_count - se_count,
ymax = mean_count + se_count),
width = 0.2
) +
labs(
title = "Average count by day of month (1–31)",
x = "day",
y = "count"
) +
theme_bw()
p4
The day-of-month does not effectively explain demand. This variable was
excluded from modelling.
set.seed(123)
split <- initial_split(train_imp, prop = 0.75)
train_set <- training(split)
test_set <- testing(split)
train_set <- train_set %>%
mutate(
hour_bin = case_when(
hour %in% c(0,1,2,3,4,5) ~ "late_night",
hour %in% c(6,7,8,9) ~ "morning_peak",
hour %in% c(10,11,12,13) ~ "midday",
hour %in% c(14,15,16) ~ "afternoon",
hour %in% c(17,18,19) ~ "evening_peak",
TRUE ~ "night"
),
hour_bin = factor(hour_bin,
levels = c("late_night","morning_peak","midday",
"afternoon","evening_peak","night"))
)
test_set <- test_set %>%
mutate(
hour_bin = case_when(
hour %in% c(0,1,2,3,4,5) ~ "late_night",
hour %in% c(6,7,8,9) ~ "morning_peak",
hour %in% c(10,11,12,13) ~ "midday",
hour %in% c(14,15,16) ~ "afternoon",
hour %in% c(17,18,19) ~ "evening_peak",
TRUE ~ "night"
),
hour_bin = factor(hour_bin,
levels = c("late_night","morning_peak","midday",
"afternoon","evening_peak","night"))
)
# Baseline Linear Regression with lambda 0.3
bike_rec <- recipe(bc_count ~ ., data = train_set) %>%
step_rm(datetime, casual, registered, count, log_count, season,
day, hour, wday, wday_num) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
bike_prep <- prep(bike_rec)
train_baked <- bake(bike_prep, new_data = NULL)
test_baked <- bake(bike_prep, new_data = test_set)
lm_bc <- lm(bc_count ~ ., data = train_baked)
summary(lm_bc)
##
## Call:
## lm(formula = bc_count ~ ., data = train_baked)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9375 -1.6874 -0.1061 1.7547 9.0785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.95686 0.02913 376.144 < 2e-16 ***
## temp 0.81772 0.19603 4.171 3.06e-05 ***
## atemp 0.53762 0.17746 3.030 0.002457 **
## humidity -0.55380 0.04001 -13.841 < 2e-16 ***
## windspeed -0.11122 0.03280 -3.391 0.000699 ***
## holiday_Yes -0.05062 0.03052 -1.658 0.097256 .
## workingday_Yes -0.04898 0.03015 -1.625 0.104253
## weather_Mist -0.05955 0.03159 -1.885 0.059485 .
## weather_Light.Snow.Rain -0.48808 0.03260 -14.970 < 2e-16 ***
## year_X2012 0.93090 0.02967 31.374 < 2e-16 ***
## month_Feb 0.13091 0.04033 3.246 0.001175 **
## month_Mar 0.16197 0.04293 3.773 0.000162 ***
## month_Apr 0.32553 0.04531 7.185 7.30e-13 ***
## month_May 0.53893 0.05195 10.375 < 2e-16 ***
## month_Jun 0.36724 0.05923 6.200 5.92e-10 ***
## month_Jul 0.14072 0.06519 2.159 0.030898 *
## month_Aug 0.24098 0.06401 3.765 0.000168 ***
## month_Sep 0.46658 0.05718 8.160 3.85e-16 ***
## month_Oct 0.66251 0.04904 13.510 < 2e-16 ***
## month_Nov 0.65037 0.04202 15.479 < 2e-16 ***
## month_Dec 0.65642 0.04187 15.676 < 2e-16 ***
## hour_bin_morning_peak 2.83957 0.03453 82.237 < 2e-16 ***
## hour_bin_midday 2.79471 0.03808 73.392 < 2e-16 ***
## hour_bin_afternoon 2.63601 0.03912 67.380 < 2e-16 ***
## hour_bin_evening_peak 3.51585 0.03724 94.401 < 2e-16 ***
## hour_bin_night 2.31663 0.03541 65.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.632 on 8138 degrees of freedom
## Multiple R-squared: 0.7389, Adjusted R-squared: 0.7381
## F-statistic: 921.1 on 25 and 8138 DF, p-value: < 2.2e-16
lm_step <- stats::step(lm_bc, direction = "both")
## Start: AIC=15827.17
## bc_count ~ temp + atemp + humidity + windspeed + holiday_Yes +
## workingday_Yes + weather_Mist + weather_Light.Snow.Rain +
## year_X2012 + month_Feb + month_Mar + month_Apr + month_May +
## month_Jun + month_Jul + month_Aug + month_Sep + month_Oct +
## month_Nov + month_Dec + hour_bin_morning_peak + hour_bin_midday +
## hour_bin_afternoon + hour_bin_evening_peak + hour_bin_night
##
## Df Sum of Sq RSS AIC
## <none> 56375 15827
## - workingday_Yes 1 18 56393 15828
## - holiday_Yes 1 19 56394 15828
## - weather_Mist 1 25 56399 15829
## - month_Jul 1 32 56407 15830
## - atemp 1 64 56438 15834
## - month_Feb 1 73 56448 15836
## - windspeed 1 80 56454 15837
## - month_Aug 1 98 56473 15839
## - month_Mar 1 99 56473 15839
## - temp 1 121 56495 15843
## - month_Jun 1 266 56641 15864
## - month_Apr 1 358 56732 15877
## - month_Sep 1 461 56836 15892
## - month_May 1 746 57120 15932
## - month_Oct 1 1264 57639 16006
## - humidity 1 1327 57702 16015
## - weather_Light.Snow.Rain 1 1552 57927 16047
## - month_Nov 1 1660 58034 16062
## - month_Dec 1 1702 58077 16068
## - year_X2012 1 6819 63194 16757
## - hour_bin_night 1 29651 86025 19276
## - hour_bin_afternoon 1 31450 87825 19444
## - hour_bin_midday 1 37313 93687 19972
## - hour_bin_morning_peak 1 46849 103223 20763
## - hour_bin_evening_peak 1 61734 118108 21863
summary(lm_step)
##
## Call:
## lm(formula = bc_count ~ temp + atemp + humidity + windspeed +
## holiday_Yes + workingday_Yes + weather_Mist + weather_Light.Snow.Rain +
## year_X2012 + month_Feb + month_Mar + month_Apr + month_May +
## month_Jun + month_Jul + month_Aug + month_Sep + month_Oct +
## month_Nov + month_Dec + hour_bin_morning_peak + hour_bin_midday +
## hour_bin_afternoon + hour_bin_evening_peak + hour_bin_night,
## data = train_baked)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9375 -1.6874 -0.1061 1.7547 9.0785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.95686 0.02913 376.144 < 2e-16 ***
## temp 0.81772 0.19603 4.171 3.06e-05 ***
## atemp 0.53762 0.17746 3.030 0.002457 **
## humidity -0.55380 0.04001 -13.841 < 2e-16 ***
## windspeed -0.11122 0.03280 -3.391 0.000699 ***
## holiday_Yes -0.05062 0.03052 -1.658 0.097256 .
## workingday_Yes -0.04898 0.03015 -1.625 0.104253
## weather_Mist -0.05955 0.03159 -1.885 0.059485 .
## weather_Light.Snow.Rain -0.48808 0.03260 -14.970 < 2e-16 ***
## year_X2012 0.93090 0.02967 31.374 < 2e-16 ***
## month_Feb 0.13091 0.04033 3.246 0.001175 **
## month_Mar 0.16197 0.04293 3.773 0.000162 ***
## month_Apr 0.32553 0.04531 7.185 7.30e-13 ***
## month_May 0.53893 0.05195 10.375 < 2e-16 ***
## month_Jun 0.36724 0.05923 6.200 5.92e-10 ***
## month_Jul 0.14072 0.06519 2.159 0.030898 *
## month_Aug 0.24098 0.06401 3.765 0.000168 ***
## month_Sep 0.46658 0.05718 8.160 3.85e-16 ***
## month_Oct 0.66251 0.04904 13.510 < 2e-16 ***
## month_Nov 0.65037 0.04202 15.479 < 2e-16 ***
## month_Dec 0.65642 0.04187 15.676 < 2e-16 ***
## hour_bin_morning_peak 2.83957 0.03453 82.237 < 2e-16 ***
## hour_bin_midday 2.79471 0.03808 73.392 < 2e-16 ***
## hour_bin_afternoon 2.63601 0.03912 67.380 < 2e-16 ***
## hour_bin_evening_peak 3.51585 0.03724 94.401 < 2e-16 ***
## hour_bin_night 2.31663 0.03541 65.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.632 on 8138 degrees of freedom
## Multiple R-squared: 0.7389, Adjusted R-squared: 0.7381
## F-statistic: 921.1 on 25 and 8138 DF, p-value: < 2.2e-16
plot(lm_step)
vif(lm_step)
## temp atemp humidity
## 45.283717 37.108725 1.886645
## windspeed holiday_Yes workingday_Yes
## 1.267428 1.097731 1.070974
## weather_Mist weather_Light.Snow.Rain year_X2012
## 1.176206 1.252593 1.037390
## month_Feb month_Mar month_Apr
## 1.916405 2.171410 2.418693
## month_May month_Jun month_Jul
## 3.179689 4.134218 5.007400
## month_Aug month_Sep month_Oct
## 4.828409 3.852621 2.833805
## month_Nov month_Dec hour_bin_morning_peak
## 2.080328 2.066147 1.404942
## hour_bin_midday hour_bin_afternoon hour_bin_evening_peak
## 1.708697 1.803512 1.634517
## hour_bin_night
## 1.477497
lm_no_atemp <- lm(
bc_count ~ temp + humidity + windspeed +
holiday_Yes + workingday_Yes + weather_Mist +
year_X2012 + month_Feb + month_Mar + month_Apr + month_May +
month_Jun + month_Jul + month_Aug + month_Sep + month_Oct +
month_Nov + month_Dec + hour_bin_morning_peak + hour_bin_midday +
hour_bin_afternoon + hour_bin_evening_peak + hour_bin_night,
data = train_baked
)
lm_step_no_atemp <- stats::step(lm_no_atemp, direction = "both")
## Start: AIC=16060.1
## bc_count ~ temp + humidity + windspeed + holiday_Yes + workingday_Yes +
## weather_Mist + year_X2012 + month_Feb + month_Mar + month_Apr +
## month_May + month_Jun + month_Jul + month_Aug + month_Sep +
## month_Oct + month_Nov + month_Dec + hour_bin_morning_peak +
## hour_bin_midday + hour_bin_afternoon + hour_bin_evening_peak +
## hour_bin_night
##
## Df Sum of Sq RSS AIC
## <none> 58035 16060
## - holiday_Yes 1 17 58052 16060
## - month_Jul 1 25 58060 16062
## - workingday_Yes 1 38 58073 16064
## - weather_Mist 1 39 58074 16064
## - month_Feb 1 57 58091 16066
## - month_Aug 1 83 58117 16070
## - month_Mar 1 95 58130 16072
## - month_Jun 1 248 58282 16093
## - windspeed 1 254 58288 16094
## - month_Apr 1 328 58363 16104
## - month_Sep 1 458 58492 16122
## - month_May 1 747 58782 16163
## - month_Oct 1 1269 59304 16235
## - month_Nov 1 1734 59769 16298
## - month_Dec 1 1856 59891 16315
## - temp 1 2875 60909 16453
## - humidity 1 3213 61248 16498
## - year_X2012 1 6684 64719 16948
## - hour_bin_night 1 28866 86900 19354
## - hour_bin_afternoon 1 30046 88081 19464
## - hour_bin_midday 1 36051 94086 20003
## - hour_bin_morning_peak 1 46446 104480 20858
## - hour_bin_evening_peak 1 60147 118182 21864
summary(lm_step_no_atemp)
##
## Call:
## lm(formula = bc_count ~ temp + humidity + windspeed + holiday_Yes +
## workingday_Yes + weather_Mist + year_X2012 + month_Feb +
## month_Mar + month_Apr + month_May + month_Jun + month_Jul +
## month_Aug + month_Sep + month_Oct + month_Nov + month_Dec +
## hour_bin_morning_peak + hour_bin_midday + hour_bin_afternoon +
## hour_bin_evening_peak + hour_bin_night, data = train_baked)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6796 -1.6539 -0.0954 1.7717 8.8316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.95686 0.02955 370.771 < 2e-16 ***
## temp 1.39497 0.06947 20.080 < 2e-16 ***
## humidity -0.78758 0.03710 -21.229 < 2e-16 ***
## windspeed -0.19347 0.03243 -5.967 2.53e-09 ***
## holiday_Yes -0.04789 0.03092 -1.549 0.121423
## workingday_Yes -0.07060 0.03053 -2.312 0.020780 *
## weather_Mist 0.07218 0.03075 2.347 0.018949 *
## year_X2012 0.92120 0.03009 30.618 < 2e-16 ***
## month_Feb 0.11517 0.04088 2.817 0.004855 **
## month_Mar 0.15895 0.04354 3.650 0.000263 ***
## month_Apr 0.31144 0.04592 6.782 1.27e-11 ***
## month_May 0.53951 0.05270 10.238 < 2e-16 ***
## month_Jun 0.35339 0.05996 5.894 3.93e-09 ***
## month_Jul 0.12429 0.06586 1.887 0.059184 .
## month_Aug 0.21898 0.06427 3.407 0.000660 ***
## month_Sep 0.46358 0.05787 8.011 1.29e-15 ***
## month_Oct 0.66377 0.04975 13.342 < 2e-16 ***
## month_Nov 0.66465 0.04261 15.597 < 2e-16 ***
## month_Dec 0.68440 0.04241 16.136 < 2e-16 ***
## hour_bin_morning_peak 2.82614 0.03501 80.713 < 2e-16 ***
## hour_bin_midday 2.72884 0.03838 71.110 < 2e-16 ***
## hour_bin_afternoon 2.54924 0.03927 64.918 < 2e-16 ***
## hour_bin_evening_peak 3.43858 0.03744 91.849 < 2e-16 ***
## hour_bin_night 2.28008 0.03583 63.630 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.67 on 8140 degrees of freedom
## Multiple R-squared: 0.7312, Adjusted R-squared: 0.7304
## F-statistic: 962.7 on 23 and 8140 DF, p-value: < 2.2e-16
plot(lm_step_no_atemp)
inv_boxcox <- function(y_bc, lambda) {
((y_bc * lambda) + 1)^(1 / lambda) - 1
}
pred_bc <- predict(lm_step_no_atemp, newdata = test_baked)
pred_cnt <- inv_boxcox(pred_bc, lambda)
rmsle <- function(pred, actual) {
sqrt(mean((log(pred + 1) - log(actual + 1))^2))
}
rmsle_lm_bc <- rmsle(pred_cnt, test_set$count)
rmsle_lm_bc
## [1] 0.728459
The first model fitted was a linear regression using the Box–Cox–transformed response (bc_count). To reduce multicollinearity and prevent overfitting due to excessive dummy variables, I grouped the 24-hour into six broader time bins representing distinct behavioural periods.
The baseline linear model achieved an adjusted R^2 of approximately 0.73. However, diagnostic plots revealed clear heteroscedasticity. As fitted values increased, the variance of residuals also increased, violating the constant variance assumption. When evaluated using RMSLE—the competition metric for this Kaggle task, the linear model achieved a score of 0.728.
# poisson model
# prep
rec_cnt <- recipe(count ~ ., data = train_set) %>%
step_rm(
datetime,
casual, registered,
log_count, bc_count,
season, day, hour,
wday, wday_num
) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
prep_cnt <- prep(rec_cnt)
train_baked_cnt <- bake(prep_cnt, new_data = NULL)
test_baked_cnt <- bake(prep_cnt, new_data = test_set)
# modelling
pois_mod <- glm(
count ~ .,
data = train_baked_cnt,
family = poisson(link = "log")
)
summary(pois_mod)
##
## Call:
## glm(formula = count ~ ., family = poisson(link = "log"), data = train_baked_cnt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.8112046 0.0013249 3631.306 <2e-16 ***
## temp 0.1179154 0.0050568 23.318 <2e-16 ***
## atemp 0.1348716 0.0046365 29.089 <2e-16 ***
## humidity -0.0949470 0.0011546 -82.234 <2e-16 ***
## windspeed -0.0090818 0.0008822 -10.294 <2e-16 ***
## holiday_Yes -0.0099195 0.0008754 -11.332 <2e-16 ***
## workingday_Yes 0.0069077 0.0008292 8.331 <2e-16 ***
## weather_Mist -0.0161882 0.0008890 -18.209 <2e-16 ***
## weather_Light.Snow.Rain -0.1114690 0.0010938 -101.906 <2e-16 ***
## year_X2012 0.2219856 0.0008352 265.802 <2e-16 ***
## month_Feb 0.0375297 0.0015396 24.377 <2e-16 ***
## month_Mar 0.0723670 0.0015302 47.292 <2e-16 ***
## month_Apr 0.1055305 0.0015343 68.783 <2e-16 ***
## month_May 0.1457614 0.0016557 88.035 <2e-16 ***
## month_Jun 0.1218011 0.0018115 67.238 <2e-16 ***
## month_Jul 0.0663951 0.0019722 33.665 <2e-16 ***
## month_Aug 0.0955963 0.0019241 49.683 <2e-16 ***
## month_Sep 0.1401174 0.0017661 79.335 <2e-16 ***
## month_Oct 0.1789193 0.0015671 114.172 <2e-16 ***
## month_Nov 0.1664481 0.0014164 117.517 <2e-16 ***
## month_Dec 0.1694252 0.0014355 118.029 <2e-16 ***
## hour_bin_morning_peak 0.8168870 0.0018047 452.643 <2e-16 ***
## hour_bin_midday 0.7525513 0.0018640 403.723 <2e-16 ***
## hour_bin_afternoon 0.6950053 0.0017044 407.772 <2e-16 ***
## hour_bin_evening_peak 0.8603717 0.0016529 520.513 <2e-16 ***
## hour_bin_night 0.6526048 0.0018674 349.481 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1340258 on 8163 degrees of freedom
## Residual deviance: 382995 on 8138 degrees of freedom
## AIC: 435224
##
## Number of Fisher Scoring iterations: 5
overdisp <- sum(residuals(pois_mod, type = "pearson")^2) / pois_mod$df.residual
overdisp
## [1] 49.11274
nb_mod <- glm.nb(
count ~ .,
data = train_baked_cnt
)
summary(nb_mod)
##
## Call:
## glm.nb(formula = count ~ ., data = train_baked_cnt, init.theta = 2.429944107,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.795439 0.007244 661.986 < 2e-16 ***
## temp 0.208799 0.048528 4.303 1.69e-05 ***
## atemp 0.097638 0.043937 2.222 0.026268 *
## humidity -0.115533 0.009949 -11.612 < 2e-16 ***
## windspeed -0.027162 0.008134 -3.339 0.000840 ***
## holiday_Yes -0.031503 0.007582 -4.155 3.25e-05 ***
## workingday_Yes -0.114657 0.007477 -15.335 < 2e-16 ***
## weather_Mist -0.010592 0.007848 -1.350 0.177096
## weather_Light.Snow.Rain -0.113198 0.008168 -13.858 < 2e-16 ***
## year_X2012 0.224714 0.007368 30.500 < 2e-16 ***
## month_Feb 0.034124 0.010143 3.364 0.000767 ***
## month_Mar 0.047461 0.010765 4.409 1.04e-05 ***
## month_Apr 0.081601 0.011336 7.198 6.10e-13 ***
## month_May 0.141130 0.012957 10.892 < 2e-16 ***
## month_Jun 0.102164 0.014752 6.925 4.35e-12 ***
## month_Jul 0.064226 0.016227 3.958 7.56e-05 ***
## month_Aug 0.073277 0.015932 4.599 4.24e-06 ***
## month_Sep 0.126700 0.014242 8.896 < 2e-16 ***
## month_Oct 0.166584 0.012227 13.624 < 2e-16 ***
## month_Nov 0.167215 0.010493 15.935 < 2e-16 ***
## month_Dec 0.174362 0.010459 16.671 < 2e-16 ***
## hour_bin_morning_peak 0.862512 0.008657 99.628 < 2e-16 ***
## hour_bin_midday 0.753546 0.009529 79.081 < 2e-16 ***
## hour_bin_afternoon 0.696042 0.009747 71.411 < 2e-16 ***
## hour_bin_evening_peak 0.874560 0.009278 94.257 < 2e-16 ***
## hour_bin_night 0.664410 0.008885 74.781 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.4299) family taken to be 1)
##
## Null deviance: 26393.6 on 8163 degrees of freedom
## Residual deviance: 8886.1 on 8138 degrees of freedom
## AIC: 92181
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.4299
## Std. Err.: 0.0387
##
## 2 x log-likelihood: -92127.4900
Next, a Poisson regression model was fitted on the original count scale. A formal overdispersion test yielded a value close to 49, indicating that the variance far exceeded the mean and that the Poisson assumption was severely violated. As a result, a Negative Binomial (NB) model was fitted instead, with an estimated dispersion parameter theta ~= 2.43.
plot(nb_mod)
overdisp_nb <- sum(residuals(nb_mod, type="pearson")^2) / nb_mod$df.residual
overdisp_nb
## [1] 1.03498
pred_nb <- predict(nb_mod, newdata = test_baked_cnt, type = "response")
rmsle_nb <- rmsle(pred_nb, test_set$count)
rmsle_nb
## [1] 0.7502783
The NB model captured the overall structure of the data reasonably well, identifying the correct directional effects of weather, seasonality, and time. However, it continued to struggle with extreme high-demand periods, showing a heavy right-tail mismatch and mild heteroscedasticity. Although the NB model serves as a solid statistical baseline, its RMSLE of approximately 0.75 was worse than that of the linear model. This performance gap is likely due to the very high variance in peak demand hours. Even with an extra dispersion parameter, the NB model remains constrained by its parametric form and cannot flexibly adapt to sharp demand spikes.
rf_rec <- recipe(count ~ temp + atemp + humidity + windspeed +
holiday + workingday + weather +
year + month + hour,
data = train_imp) %>%
step_dummy(all_nominal_predictors())
prep_rf <- prep(rf_rec)
train_baked_rf <- bake(prep_rf, new_data = NULL)
test_baked_rf <- bake(prep_rf, new_data = test_set)
# random forest
set.seed(123)
rf_basic <- randomForest(
count ~ .,
data = train_baked_rf,
ntree = 500,
mtry = 3,
importance = TRUE
)
rf_basic
##
## Call:
## randomForest(formula = count ~ ., data = train_baked_rf, ntree = 500, mtry = 3, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 8933.908
## % Var explained: 72.77
pred_basic <- predict(rf_basic, newdata = test_baked_rf)
rmsle_rf_basic <- rmsle(pred_basic, test_set$count)
rmsle_rf_basic
## [1] 0.9401285
0.94
#tuning
K <- 5
fold_id <- sample(rep(1:K, length.out = nrow(train_baked_rf)))
param_grid <- expand.grid(
mtry = c(3, 5, 7, 9),
nodesize = c(5, 15, 30),
ntree = c(300, 500, 800)
)
results <- param_grid
results$cv_rmsle <- NA_real_
#for (i in seq_len(nrow(param_grid))) {
# p <- param_grid[i, ]
# cv_scores <- numeric(K)
#
# for (k in 1:K) {
# train_fold <- train_baked_rf[fold_id != k, ]
# valid_fold <- train_baked_rf[fold_id == k, ]
#
# fit <- randomForest(
# count ~ .,
# data = train_fold,
# ntree = p$ntree,
# mtry = p$mtry,
# nodesize = p$nodesize
# )
# pred <- predict(fit, newdata = valid_fold)
# cv_scores[k] <- rmsle(pred, valid_fold$count)
# }
# results$cv_rmsle[i] <- mean(cv_scores)
# cat("Done:", i, "of", nrow(param_grid),
# "| mtry =", p$mtry,
# "nodesize =", p$nodesize,
# "ntree =", p$ntree,
# "| CV RMSLE =", round(results$cv_rmsle[i], 4), "\n")
#}
#results <- results[order(results$cv_rmsle), ]
#head(results, 5)
best <- results[1, ]
best
## mtry nodesize ntree cv_rmsle
## 1 3 5 300 NA
rf_best <- randomForest(
count ~ .,
data = train_baked_rf,
ntree = 500,#best$ntree,
mtry = 9,#best$mtry,
nodesize = 5,#best$nodesize,
importance = TRUE
)
pred_test <- predict(rf_best, newdata = test_baked_rf)
rmsle_rf_best <- rmsle(pred_test, test_set$count)
rmsle_rf_best
## [1] 0.4310566
Finally, a Random Forest regression model was applied to model nonlinear relationships and complex interactions. The untuned baseline model achieved an R^2 of approximately 0.73 and an RMSLE of 0.94, indicating suboptimal default hyperparameters. After systematic tuning, the best-performing configuration was obtained with mtry = 9, nodesize = 5, and ntree = 500, yielding an RMSLE of 0.43. This substantial improvement confirms that demand is driven by highly nonlinear interactions between hour, months and year, and weather conditions, which are better captured by ensemble tree-based methods than by parametric models.
varImpPlot(rf_best)
Vairable importance was evaluated using the mean decrease in prediction error across trees, reflecting how much each variable contributes to reducing overall forecasting error.
The most influential predictor by a large margin is hour of day. This confirms that hourly demand is fundamentally structured by daily human activity cycles. In particular, commuting behavior during morning and evening peak hours dominates overall usage patterns. The model relies heavily on this variable to separate low-demand nighttime periods from high-demand commuting periods, making hour the single most critical feature.
The year variable also shows high importance since the demand drastically grew from the period of 2011 to 2012.
Workingday, Month, temperature, humidity and wind are other important factors.
Overall, the variable importance results reinforce a clear hierarchy. Time-driven behavioral patterns dominate bike-sharing demand, while weather conditions act as modifiers on top of this temporal backbone. This explains why Random Forest, which naturally captures interactions between time and weather, substantially outperforms linear and count-based models.
In this project, I examined hourly bike-sharing demand by combining detailed exploratory analysis with a range of statistical models: Linear, Poisson, Negative Binomial and an ensenble model, Random Forest. The analysis confirms that demand is primarily governed by strong temporal structure, which are hour of day, workingday status, and long-term calendar effects define the fundamental shape of usage. Weather conditions such as temperature, humidity, and windspeed consistently influence demand, but they act mainly as modifiers rather than core drivers.
From a modelling perspective, linear regression with a Box–Cox transformation provides a strong and interpretable baseline, capturing the main directional effects of weather and seasonality. Negative Binomial regression improves upon the Poisson model by addressing overdispersion, but remains limited by its parametric form and struggles with extreme peak-demand periods. In contrast, the tuned Random Forest model substantially outperforms all parametric approaches, achieving the lowest RMSLE by effectively learning nonlinear interactions and sharp demand spikes associated with commuting behavior.
Overall, the results suggest that accurate demand forecasting requires models that can accommodate both strong periodic structure and context-dependent nonlinear effects. For operational use, the tuned Random Forest provides a reliable forecasting tool, while simpler parametric models remain valuable for interpretation and understanding underlying demand drivers.